| Dataset | x mean | x sd | y mean | y sd |
|---|---|---|---|---|
| dino | 54.3 | 16.8 | 47.8 | 26.9 |
| away | 54.3 | 16.8 | 47.8 | 26.9 |
| star | 54.3 | 16.8 | 47.8 | 26.9 |
| bullseye | 54.3 | 16.8 | 47.8 | 26.9 |
SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research
Semester 1, 2026
Last updated: 2026-01-22
I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.
By the end of this week, you will be able to:
ggplot2TSwD
ROS
The Golden Rule
Summary statistics alone can be misleading. Always visualise your data before drawing conclusions.
“A world turning to a saner and richer civilisation will be a world turning to charts.” — Karsten (1923)
Consider these four datasets with nearly identical summary statistics:
| Dataset | x mean | x sd | y mean | y sd |
|---|---|---|---|---|
| dino | 54.3 | 16.8 | 47.8 | 26.9 |
| away | 54.3 | 16.8 | 47.8 | 26.9 |
| star | 54.3 | 16.8 | 47.8 | 26.9 |
| bullseye | 54.3 | 16.8 | 47.8 | 26.9 |
What do they actually look like?
The same lesson from the statistician Frank Anscombe:
Key Insight from ROS
Every graph is fundamentally a comparison: to zero, to a reference line, to other data points, or to our expectations.
When making graphs, we should:
A scatterplot can display up to five variables easily:
A grid of plots adds two more dimensions!
ggplot2 implements the grammar of graphics:
Key Components
Common Geoms
geom_bar(): Bar chartsgeom_point(): Scatterplotsgeom_line(): Line plotsgeom_histogram(): Histogramsgeom_boxplot(): BoxplotsBar charts are ideal when you have a categorical variable that you want to focus on.
Add a second variable using fill:
Use position = "dodge2" for side-by-side comparison:
geom_bar()
geom_col()
Scatterplots show the relationship between two continuous variables.
Expert Advice
“A scatterplot may not always be the best choice, but it is rarely a bad one.” — Weissgerber et al. (2015)
Some consider it the most versatile and useful graph option.
Two strategies for overlapping points:
Transparency (alpha)
Jitter
Line plots are ideal when data points should be connected, typically for:
Use geom_step() to emphasise discrete changes:
Histograms show the distribution of a continuous variable by:
The number of bins affects interpretation:
Note
Too few bins = too much smoothing. Too many bins = too much noise.
Use geom_freqpoly() to compare groups:
An alternative view with stat_ecdf():
A boxplot displays five key statistics:
Warning
The same boxplot can represent very different distributions!
Show the actual data alongside summary statistics:
ggplot2 includes several built-in themes:
Use labs() to add context:
Split your plot by a categorical variable:
Create a two-dimensional grid of panels:
RColorBrewer
Viridis (colour-blind friendly)
Combine multiple plots with patchwork:
Design for Your Audience
The success of a graph depends on how much information is lost in the encoding-decoding process.
Don’t
Do
Reporting Numbers
Don’t report numbers to too many decimal places. Display precision that respects the uncertainty in your data.
[3.276, 6.410] → Better written as [3.3, 6.4]options(digits = 2) sets global roundingAlways plot your data — summary statistics can be misleading
All graphs are comparisons — design to make key comparisons clear
Choose appropriate geoms:
Customise thoughtfully — themes, labels, colours, and facets
Week 5: Data Cleaning and Probability Simulation